Computational Learning Theory

Can we identify easy-difficult classes of learning problems? regardless of learning algorithm?
Can we determine the number of training examples needed to (probably) learn a concept?
Does it matter the order in which examples are presented? How?
How well can we learn this concept?
Computational Learning Theory addresses these questions.
We will address the
- Sample complexity
- Computational complexity
- Mistake bound

2 Probably Approximately Correct

The probably approximately correct (PAC) learning model defines a setting and gives answers to our questions in that setting.
Leslie Valiant introduced PAC learning in A theory of the learnable [3]. CACM 1984.
Roughly, it tells us how many examples (and computation) we will need to see before we can learn a hypothesis is probably H, where H is approximately correct.
PAC learning results are independent of the learning algorithm used!

2.1 Concept Learning

GivenInstances $X$: Possible days, each described by the attributes: Sky, AirTemp, Humidity, Wind, Water, Forecast.
Target function $c$: $EnjoySport: X \rightarrow \{0,1 \}$
Hypotheses $H$: Conjunctions of literals. E.g. \[\langle ?, Cold, High, ?, ?, ? \rangle. \]
Training examples $D$: Positive and negative examples of the target function \[\langle x_1, c(x_1) \rangle , \ldots \langle x_m, c(x_m) \rangle\]
Determine:
A hypothesis $h$ in $H$ such that $h(x)=c(x)$ for all $x$ in $D$?
A hypothesis $h$ in $H$ such that $h(x)=c(x)$ for all $x$ in $X$?

2.1.1 Sample Complexity

How many training examples are sufficient to learn the target concept?
The examples could come in several ways:

The learner proposes instances, as queries to teacher: Learner proposes instance $x$, teacher provides $c(x)$. Play 20 questions, divide space in half.
The teacher (who knows $c$) provides training examples teacher provides sequence of examples of form $\langle x, c(x) \rangle$. Best strategy depends on learner.
There is some random process that proposes the instances. $x$ is generated randomly, teacher provides $c(x)$.Hmmmm, this needs further analysis.

2.2 Problem Setting

Given a set of instances $X$, set of hypotheses $H$, set of possible target concepts $C$, and training instances generated by a fixed, unknown probability distribution $\cal{D}$ over $X$
Learner observes a sequence $D$ of training examples of form $\langle x, c(x)\rangle$, for some target concept $c \in C$
The instances $x$ are drawn from distribution $\cal{D}$ and the teacher provides target value $c(x)$ for each.
Learner must output a hypothesis $h$ estimating $c$. $h$ is evaluated by its performance on subsequent instances drawn according to $\cal{D}$

2.3 The Error of a Hypothesis

The true error (denoted $error_{\cal{D}}(h)$) of hypothesis $h$ with respect to target concept $c$ and distribution $\cal{D}$ is the probability that $h$ will misclassify an instance drawn at random according to $\cal{D}$. \[error_{\cal{D}}(h) \equiv \Pr_{x \in \cal{D}}[c(x) \neq h(x)] \]
That is, the proportion of hypotheses in the two identified areas to the left.
The training error of $h$ with respect to $c$ represents how often it is wrong on the training data. That is, \[\Pr_{x \in \cal{D} \wedge x \in X}[c(x) \neq h(x)] \]
We will start by assuming a training error of 0.

2.4 PAC Learnability

Given a definition of error, we can now define PAC learnability.
A concept class $C$ is PAC-learnable by $L$ using $H$ if for all $c \in C$, distributions $\cal{D}$ over $X$, $\epsilon$ such that $0 < \epsilon < 1/2$, and $\delta$ such that $0 < \delta < 1/2$, learner $L$ will with probability at least $(1-\delta)$ output a hypothesis $h \in H$ such that $error_{\cal{D}}(h) \leq \epsilon$, in time that is polynomial in $1/\epsilon$, $1/\delta$, $n$ and $size(c)$.
Notice that even thought we only mention computation time, computation time grows with the number of examples needed.
So, we want to PAC-learn some concept, but when can we do it?

3 Sample Complexity for Finite Hypothesis Spaces

The growth in the number of required training examples with problem size is called the sample complexity of the learning problem.
We will consider only consistent learners, which are those that maintain a training error of 0.
We can derive a bound on the number of training examples required by any consistent learner!
Fact: Every consistent learner outputs a hypothesis belonging to the version space.
Therefore, we need to bound the number of examples needed to assure that the version space contains no unacceptable hypothesis.
The version space $VS_{H,D}$ is said to be ε-exhausted with respect to $c$ and $\cal{D}$, if every hypothesis $h$ in $VS_{H,D}$ has error less than ε with respect to $c$ and $\cal{D}$. \[(\forall h \in VS_{H,D}) error_{\cal{D}}(h) < \epsilon \]

3.1 How Many Examples Needed To Exhaust VS?

If the hypothesis space $H$ is finite, and $D$ is a sequence of $m \geq 1$ independent random examples of some target concept $c$, then for any $0 \leq \epsilon \leq 1$, the probability that the version space with respect to $H$ and $D$ is not ε-exhausted (with respect to $c$) is less than \[|H| e^{-\epsilon m} \] ...see book for proof.
This bounds the probability that any consistent learner will output a hypothesis $h$ with $error(h) \geq \epsilon$ If we want to this probability to be below $\delta$ \[|H|e^{- \epsilon m} \leq \delta \] then \[m \geq \frac{1}{\epsilon}(\ln|H| + \ln(\frac{1}{\delta})) \]
So, any consistent learner can PAC-learn any target concept in $H$ with $m$ examples, where PAC-learn means that the hypothesis is probably ($1 - \delta$) approximately (within error ε) correct.

3.2 Agnostic Learning

So far, we have assumed that assumed $c \in H$
A learning that does not assume that the target concept is representable by $H$ and simply finds the hypothesis with minimum training error is an agnostic learner.
We want the hypothesis $h$ that makes fewest errors on training data.
The sample complexity is \[ m \geq \frac{1}{2 \epsilon^{2}}(\ln|H| + \ln(1/\delta)) \] derived from Hoeffding bounds: \[ Pr[error_{\cal{D}}(h) > error_D(h) + \epsilon] \leq e^{-2m\epsilon^{2}} \] were $D$ is a set containing $m$ randomly drawn examples.

3.3 Conjunctions of Boolean Literals are PAC-Learnable

Consider the class C of target concepts described by conjunctions of boolean literals, e.g. \[Old \wedge \not Tall\]
Is it PAC-learnable? We need to show that any consistent learner will require only a polynomial number of training examples to learn any $c \in C$.
How many examples are sufficient to assure with probability at least $(1 - \delta)$ that every $h$ in $VS_{H,D}$ satisfies $error_{\cal{D}}(h) \leq \epsilon$?
Use our theorem: \[m \geq \frac{1}{\epsilon}(\ln|H| + \ln(\frac{1}{\delta})) \] Suppose $H$ contains conjunctions of constraints on up to $n$ boolean attributes (i.e., $n$ boolean literals). Then $|H| = 3^n$, and \[m \geq \frac{1}{\epsilon}(\ln 3^n + \ln(\frac{1}{\delta})) \] or \[m \geq \frac{1}{\epsilon}(n \ln 3 + \ln(\frac{1}{\delta})) \] So, $m$ grows linearly or below (i.e., polynomially).
For example, Find-S is a consistent learner that uses time linear in $n$ for each example. Therefore, the class of conjunction of boolean literals is PAC-learnable by the Find-S algorithm using $H = C$.

3.4 The Unbiased Concept Class is Not PAC-Learnable

Consider the unbiased concept class that contains every teachable concept from $X$. \[|C| = 2^{|X|}\]
Let the instances be defined by $n$ boolean features. \[|X| = 2^n\] so \[|C| = 2^{2^n}\]
To learn this, the learner must use $H = C$.
We then have that \[ m \geq \frac{1}{\epsilon}(2^n \ln 2 + \ln (\frac{1}{\delta})) \] so the number of examples needed is exponential in $n$ (well, since this is just an upper bound its not a proof of this fact, but the fact is true nonetheless).

4 Sample Complexity for Infinite H

Alexei Chervonenkis [4] and Vladimir Vapnik [5]

So far, we have been using $|H|$ in our equation for sample complexity. Only works for finite $H$.
It can also lead to weak (i.e., too loose/high) bounds.
We now show how to handle infinite $H$ by using the Vapnik-Chervonenkis dimension.

4.1 Shattering a Set of Instances

A dichotomy of a set $S$ is a partition of $S$ into two disjoint subsets.
There are $2^{|S|}$ possible dichotomies in instance set $S$.
A set of instances is shattered by hypothesis space $H$ if and only if for every dichotomy of $S$ there exists some hypothesis in $H$ consistent with this dichotomy.
The ability of $H$ to shatter a set of instances is a measure of its capacity to represent target concepts over them.

4.2 The Vapnik-Chervonenkis Dimension

An unbiased hypothesis space $H$ is one that shatters instance space $X$. (and remember, unbiased is bad).
But, what if $H$ can instead shatter a large subset of $X$?
The Vapnik-Chervonenkis dimension, $VC(H)$, of hypothesis space $H$ defined over instance space $X$ is the size of the largest finite subset of $X$ shattered by $H$. If arbitrarily large finite sets of $X$ can be shattered by $H$, then $VC(H) \equiv \infty$.
For any finite $H$ we have that \[ \array{ VC(H) &= d \\ |H| &\geq 2^d \\ VC(H) &\leq \log_2|H| } \]

4.3 VC Dimension of Linear Decision Surfaces

The VC dimension of linear decision surfaces on the plane is 3.
The H is the set of all lines.
Every set of 2 points can be shattered. Some sets of 3 points can be shattered. There is no set of 4 points that can be shattered.

4.4 Sample Complexity from VC Dimension

How many randomly drawn examples suffice to $\epsilon$-exhaust $VS_{H,D}$ with probability at least $(1 - \delta)$? \[ m \geq \frac{1}{\epsilon}(4\log_2(\frac{2}{\delta}) + 8 VC(H) \log_2 (\frac{13}{\epsilon})) \]
Similar to before but we replaced the $\ln |H|$ with $VC(H)$ (recall that $VC(H) \leq \log_2|H|$).
We can also provide a lower bound on sample complexity. If $VC(C) \geq 2$, $0 < \epsilon < \frac{1}{8}$, and $0 < \delta < \frac{1}{100}$, then there exists a distribution $\mathcal{D}$ and target concept C such that if L observes fewer examples than \[ \max \left( \frac{1}{\epsilon}\log(\frac{1}{\delta}), \frac{VC(C)-1}{32\epsilon} \right) \] then with probability at least $\delta$, L outputs a hypothesis $h$ having $error_{\mathcal{D}}(h) > \epsilon$.
Note that these are close bounds.

4.5 VC Dimension for Neural Nets

Define $C_G$ to be the G-composition of $C$—the set of all functions that can be implemented by network $G$ assuming units take on functions from class $C$.
$C$ is the concept class represented by each neuron.
If G is a layered DAG with $n$ input nodes, $s \geq 2$ internal nodes each having at most $r$ inputs, and $VC(C) = d$, then \[ VC(C_G) \leq 2ds \cdot \log(es) \]
We can use this to bound the number of training examples sufficient to learn (with prob. at least $1-\delta$) any target concept from $C_G$ to within error ε.

5 Mistake Bound Model of Learning

Computational learning theory studies other models (other than PAC) were
- the order of the training examples is varied,
- there is noise in the data,
- the definition of success is different,
- the learner makes different assumptions about the distribution of instances, etc.
We will now look at the mistake bound model of learning in which the learner is evaluated by the total number of mistakes it makes before it converges to the correct hypothesis.

5.1 Mistake Bound Model: Find-S

Instances drawn at random from $X$ according to distribution $\cal{D}$
Learner must classify each instance before receiving correct classification from teacher.
Consider Find-S when $H=$ conjunction of boolean literals.
1. Initialize $h$ to the most specific hypothesis $l_{1} \wedge \neg l_{1} \wedge l_{2} \wedge \neg l_{2} \ldots l_{n} \wedge \neg l_{n}$
2. For each positive training instance $x$, remove from $h$ any literal that is not satisfied by $x$
3. Output hypothesis $h$.
How many mistakes before converging to correct $h$? $n+1$. The first hypothesis eliminates $n$ terms, each subsequent mistake will eliminate at least one more term.

5.2 Mistake Bound Model: Halving Algorithm

Consider the Halving Algorithm
- Learn concept using version space Candidate-Elimination algorithm
- Classify new instances by majority vote of version space members.
How many mistakes before converging to correct $h$? $\log_2|H|$. Every mistake eliminates at least half of the hypothesis from the version space (majority vote). The version space starts at $|H|$.

5.3 Optimal Mistake Bounds

Let $M_{A}(C)$ be the max number of mistakes made by algorithm $A$ to learn concepts in $C$. (maximum over all possible $c \in C$, and all possible training sequences) \[M_{A}(C) \equiv \max_{c \in C} M_{A}(c)\]
Let $C$ be an arbitrary non-empty concept class. The optimal mistake bound for $C$, denoted $Opt(C)$, is the minimum over all possible learning algorithms $A$ of $M_{A}(C)$. \[Opt(C) \equiv \min_{A \in learning algorithms} M_{A}(C) \]
We can relate this to the VC dimension with \[ VC(C) \leq Opt(C) \leq M_{Halving}(C) \leq log_{2}(|C|). \]

6 Summary

Research in Computational Learning Theory has lead to a better understanding of what is learnable and how fast.
It has also lead to the development and improvement of learning algorithms.
Like all others, it is an active research area. It has its own conference (COLT) [6].